【ICLR2021】CoCon: 一种自监督的可控文本生成方法

ICLR2021的论文《CoCon:A Self-Supervised Approachfor Controlled Text Generation》，提出一种用文本去指导文本生成的无监督方法，是follow了CTRL和PPLM的后续工作。作者设计了一个叫做CoCon的模块，插入transformer中，CoCon结构和正常的transformer encoder一样。

在生成文本时，假设$c$是我们想要的内容，即control部分，c的长度为$l_{c}$，句子$x$长度为$l$，被划分为$1:t-1$和$t:l$两个部分，我们使用$c$和$1:t-1$去预测$t:l$部分。具体做法如下：

首先用一个transformer encoder分别编码$c$和$x_{1:t-1}$，得到了各自的特征$h_c$和$h_{1:t-1}$，然后将他们送入CoCon模块通过self-attention融合，将$h_c$的Key和Value concat到$h_{1:t-1}$的Key和Value前面，Query不变，依旧来自于$h_{1:t-1}$

$\mathbf{K}^{\prime}=\left[\mathbf{K}^{(\mathbf{c})} ; \mathbf{K}\right], \quad \mathbf{V}^{\prime}=\left[\mathbf{V}^{(\mathbf{c})} ; \mathbf{V}\right], \quad \mathbf{A}=\operatorname{Softmax}\left(\mathbf{Q} \mathbf{K}^{\prime \top}\right) \mathbf{V}^{\prime}=\operatorname{Softmax}(\mathbf{W}) \mathbf{V}^{\prime}$ $\mathbf{h}_{i}^{\prime}=\mathrm{FF}\left(\mathbf{a}_{i}\right), \quad \tilde{\mathbf{o}}_{t}=\operatorname{dec}\left(\left[\mathbf{h}_{: t-2} ; \mathbf{h}_{t-1}^{\prime}\right]\right), \quad p_{\theta, \psi}\left(\tilde{x}_{t} \mid \mathbf{c}, x_{: t-1}\right)=\operatorname{Softmax}\left(\tilde{\mathbf{o}}_{t}\right)$ $\mathbf{h}_{: t-1}^{\prime}=\operatorname{CoCon}\left(\mathbf{h}_{: l_{c}}^{(\mathbf{c})}, \mathbf{h}_{: t-1}\right)$

$\mathbf{A}$经过feed-forward layer后得到了我们要的包含c的信息的隐变量$h’_{t-1}$，普通的transformer就是得到了$h_{1:t-1}$，然后就当作memory输入decoder去指导文本生成了，CoCon这一步就是把$h_c$和$h_{1:t-1}$ concat起来再经过一个transformer encoder，得到了我们要的$h’$

decoder的生成过程作者写的让我稍微有些困惑，生成$x_t$的时候并不是使用$h’_{1:t-1}$，而是把$h’_{t-1}$和$h_{:t-2}$ concat起来去生成$x_t$，不知道为什么不直接用$h’_{1:t-1}$去生成$x_t$。

当我们有多个想要的内容时，也就是有多个c，可以把它们一起concat起来，即

$\mathbf{K}^{\prime}=\left[\mathbf{K}^{\left(\mathbf{c}^{1}\right)} \ldots \mathbf{K}^{\left(\mathbf{c}^{N}\right)} ; \mathbf{K}\right], \quad \mathbf{V}^{\prime}=\left[\mathbf{V}^{\left(\mathbf{c}^{1}\right)} \ldots \mathbf{V}^{\left(\mathbf{c}^{N}\right)} ; \mathbf{V}\right], \quad \mathbf{A}=\operatorname{Softmax}\left(\mathbf{Q} \mathbf{K}^{\prime \top}\right) \mathbf{V}^{\prime}$

训练的过程作者使用了4个loss，首先把长度为$l$的句子$x$分为两个部分，$x^a = {x_1, …,x_{t-1}}$，$x^b = {x_t, …,x_l}$，

第一个是重构loss，让$c=x^b$，然后让模型condition on $x^a$和$c$去生成$x^b$

$\mathcal{L}_{\text {self }}=-\sum_{i=t}^{l} \log p_{\theta, \psi}\left(x_{i} \mid\left(\mathbf{c}=\mathbf{x}^{b}\right),\left\{x_{1}, \ldots, x_{i-1}\right\}\right)$

第二个是叫做Null Content Loss，$c=\varnothing$，模型只condition on $x^a$，让模型学会生成流畅的句子

$\mathcal{L}_{\text {null }}=-\sum_{i=t}^{l} \log p_{\theta, \psi}\left(x_{i} \mid(\mathbf{c}=\varnothing),\left\{x_{1}, \ldots, x_{i-1}\right\}\right)$

第三个Cycle loss我觉得是本文最大亮点，不过这个cycle思想应该在以前的很多工作中都有了，作者选出两个句子$x$和$x’$，$\mathbf{x}=\left[\mathbf{x}^{a} ; \mathbf{x}^{b}\right]$, $\mathbf{x}^{\prime}=\left[\mathbf{x}^{\prime a} ; \mathbf{x}^{\prime b}\right]$，先让模型根据$x’^a$和$c=x^b$去生成句子

$\mathbf{y}_{\mathbf{x}, \mathbf{x}^{\prime}}=f_{\theta, \psi}\left(\left(\mathbf{c}=\mathbf{x}^{b}\right),\left(\mathbf{p}=\mathbf{x}^{\prime a}\right)\right)$

然后让模型根据$x^a$和$c=y_{x,x’}$去生成句子，loss cycle的目的是要使其接近$x^b$

$\mathbf{y}_{\text {cycle }}=f_{\theta, \psi}\left(\left(\mathbf{c}=\mathbf{y}_{\mathbf{x}, \mathbf{x}^{\prime}}\right),\left(\mathbf{p}=\mathbf{x}^{a}\right)\right)$ $\mathcal{L}_{\text {cycle }}=-\sum_{i=t}^{l} \log p_{\theta, \psi}\left(\mathbf{y}_{\text {cycle }}=\mathbf{x}^{b} \mid\left(\mathbf{c}=\mathbf{y}_{\mathbf{x}, \mathbf{x}^{\prime}}\right),\left(\mathbf{p}=\mathbf{x}^{a}\right)\right)$

这块理解起来还是有点绕的，首先$y_{x,x’}$的生成过程$c=x^b$，那么我们是希望$y_{x,x’}$的内容能包含$x^b$的信息，同时要和$x’^a$的衔接保持流畅，接着使用$c=y_{x,x’}$和$x^a$生成$y_cycle$，那么我们希望$y_cycle$既能包含$y_{x,x’}$的信息，又要和$x^a$衔接流畅，而又因为$y_{x,x’}$包含$x^b$的信息，因此$y_cycle$应该既包含$x^b$的信息，又能和$x^a$衔接流畅，那它不就应该是生成了$x^b$吗？！因此$loss_{cycle}$就是计算$y_cycle$和$x^b$相差多大。作者在这里的intuition是：在现实中给出提示文本$prompt~text = x^a$，可能的衔接文本$x^b$是非常多的，因此我们希望通过给模型一个$target ~content = c$，去生成包含$c$的信息且能和$x^a$衔接流畅的$x^b$。

最后一个loss是adversarial loss，因为这在其他工作中经常被用，希望模型生成的文本尽可能与真实文本接近

$\mathcal{L}_{\mathrm{adv}}=\mathbb{E}_{\mathbf{x}}\left[\log f_{\mathrm{disc}}(\operatorname{enc}(\mathbf{x}))\right]+\mathbb{E}_{\mathbf{y}}\left[\log \left(1-f_{\mathrm{disc}}(\mathrm{enc}(\mathbf{y}))\right]\right.$

$f_{disc}$的参数为$\phi$

$\phi^{*}=\underset{\phi}{\arg \max } \mathcal{L}_{\mathrm{adv}}$

最后整个模型的训练是

$\theta^{*}=\underset{\theta}{\arg \min }\left(\lambda_{\text {self }} \mathcal{L}_{\text {self }}+\lambda_{\text {null }} \mathcal{L}_{\text {null }}+\lambda_{\text {cycle }} \mathcal{L}_{\text {cycle }}+\lambda_{\text {adv }} \mathcal{L}_{\mathrm{adv}}\right)$

$\lambda$用来控制每个loss的权重。

作者的实验基于GPT-2，数据选用了openai提供的用GPT-2生成的句子
Table 2: Content similarity and quality of generated content-conditioned samples.

sentiment和topic classifier用了一个在kaggle数据集上训练得到的分类器，发现CoCon生成的句子能更好地控制topic和sentiment

Table 3: Evaluation of topic-controlled generations. Topic accuracy report ratio of samples that were classified as their target topic.

Table 4: Evaluation of sentiment-controlled generations. Sentiment accuracy report ratio of samples that were classified as their target sentiment.

稍微有些遗憾的是作者还没有开源代码，不过ICLR2021才放榜不久，也许后续作者们会补上代码吧。

文章第一时间更新在我的公众号【天宏NLP】，欢迎扫码关注～